pdfs are the most complicated data structures to clean.
PyPDF2 reads the pdf and eases finding the right page.
import tabula, camelot, PyPDF2def extract_page_text(pdf_path):''' extract_page_text function opens a pdf an reads all the text. It is used to search specific pages '''def extract_page_text(pdf_path):withopen(pdf_path, 'rb') as pdf_file: reader = PyPDF2.PdfReader(pdf_file) page_texts = [page.extract_text() for page in reader.pages]return page_texts
camelot and tabula specialize in managing pdf Tables.
The pdf process
import tabula, camelot, PyPDF2, requestsimport pandas as pdresponse = requests.get(get, stream=True)file_name =f"{folder_spec}/gdp_file.pdf"withopen( file_name , "wb") as pdf_file:for chunk in response.iter_content(chunk_size=8192):if chunk: pdf_file.write(chunk)# Find the page numberpage_texts = extract_page_text(file_name)table_2_pages = [i +1for i, text inenumerate(page_texts) if"TABLE 2"in text ]# Iterate over all pages with the word TABLE 2 and extract them as tables inside a list "tables"if table_2_pages : tables = []for page in table_2_pages: page_tables = camelot.read_pdf(file_name, pages=str(page)) tables.extend(page_tables) df0 = pd.DataFrame()for x inrange(0,len(tables)) :# Open the Table df = tables[x].df# Cleaning df0 = pd.concat([df0, df], axis=0)
The pdf process: step-by-step
Opening the file
import camelot, PyPDF2import pandas as pddef extract_page_text(pdf_path):withopen(pdf_path, 'rb') as pdf_file: reader = PyPDF2.PdfReader(pdf_file) page_texts = [page.extract_text() for page in reader.pages]return page_textsfile_name =f"session_2_files/gdp_file.pdf"# Find the page numberpage_texts = extract_page_text(file_name)table_2_pages = [i +1for i, text inenumerate(page_texts) if"TABLE 2"in text ]print(table_2_pages)
[6, 7, 8]
Data Cleaning
Working dataset
National Budget 10 Year Monthly Data (xlsx)
Open the file with pandas
import pandas as pdfile="session_2_files/MonthlyData.xlsx"df = pd.read_excel(file)df
Unnamed: 0
JUL 2012
AUG 2012
SEP 2012
OCT 2012
NOV 2012
DEC 2012
JAN 2013
FEB 2013
MAR 2013
...
MAY 2024
JUN 2024
JUL 2024
AUG 2024
SEP 2024
OCT 2024
NOV 2024
DEC 2024
JAN 2025
FEB 2025
0
TAX REVENUE (a+b+c+d)
93040348
86483176
78642733
82883677
92179630
133238356
114824826
96878244
104590342
...
241613741
204402400
255231585
193503688
167475359
236023998
230465826
208805894
278995125
241149985
1
a. TAXES ON PROPERTY
5365152
3671873
4019217
4465873
14157876
30151541
11907058
5451162
6462140
...
21913491
6984546
9623902
6195951
9551582
7973590
21072882
14207695
20146131
23806222
2
b. TAXES ON GOODS & SEERVICES
37405548
39184491
35059983
36258228
36706457
39313639
58119469
51401840
55054774
...
141071438
127777154
183129726
117635744
102715248
161040855
124113322
121700941
189330813
165977687
3
i. GENERAL TAXES ON GOODS & SERVICES
31597088
32422854
30854576
31307502
29064451
29947164
37444634
36096632
26008730
...
116292674
110258234
168021505
111749001
95408539
142005686
105226938
107198452
165107832
112790078
4
VALUE ADDED TAX
0
0
0
0
0
0
0
0
0
...
107698722
99208190
154091380
100027953
85266170
132555478
95441872
95695606
152428688
103161582
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
83
ACQUISITION OF FIXED ASSETS
9927149
21651950
21174562
5472129
10873272
27465690
20477064
19060318
11022186
...
16851928
19112442
37315394
36110379
30875113
21811320
25377276
19164010
13350340
9418346
84
LAND
35000
610336
25929
30100
300000
0
263148
151352
0
...
0
0
0
0
0
188341
18356
684710
135000
1090526
85
TOTAL EXPENDITURE
133417164
155961204
160357472
144566394
149251682
202644733
172360192
136265908
154900104
...
261681048
270588087
335917578
279538178
252228611
344528344
335545120
291319766
305855273
234356200
86
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
87
SURPLUS/DEFICIT
-25056364
-57493129
-67011839
-47832158
-30039299
-44872480
-47925435
-34056397
-34326090
...
25793212
-35625802
-59079363
-61498551
-64538790
-88404828
-82321717
-42208300
-3793834
58580747
88 rows × 153 columns
Clean up column names
df.columns = df.columns.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower()for col in df.columns :if"unnamed: 0"in col : df.rename(columns = {col : 'variable'}, inplace=True)df.variable = df.variable.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower().str.strip()df = df.dropna(subset='variable')df
Nighttime satellite imagery provides a valuable proxy for economic activity, electrification, and urban growth — particularly in areas with limited or unreliable statistical data.
It measures the intensity of light emissions captured by satellites orbiting the Earth.
Data are typically available at fine spatial and temporal resolutions, making it a powerful complement to conventional economic indicators.
Why It’s Complex
Working with satellite data —especially NASA’s Nighttime Lights (VIIRS) data— involves multiple steps and tools.
This is one of the most complex code blocks in our workflow, combining geospatial processing, data scraping, and time-series construction.
Workflow Overview
Identify target locations
Define the geographic boundaries.
Create a shapefile or bounding box for the selected areas.
Download raw .h5 map files
Retrieve satellite imagery files directly from NASA’s repository.
Often requires authentication and automated scraping (via requests or NASA’s API).
Each file corresponds to a specific day/month.
Workflow Overview (continued)
Open and read the .h5 files
Use libraries like h5pygdal to extract pixel values.
Convert the raster data into usable arrays or DataFrames.
Aggregate data over the selected geographic units (e.g., county or grid cell).
Clean and prepare the series
Handle missing pixels and noise (cloud cover, stray light, etc.).
Average or sum pixel intensities to build a monthly or yearly time series.
Save a tidy dataset ready for merging with socioeconomic indicators.
Google Trends Data
First Estimation
Activity
Appendix
Camelot
Running camelot may need some additional Libraries